Statistical EDA Modeling Updated

Initial Analysis

Here we wanted to show the number of games being played by players to quickly discuss this. We thought going with a 20 game minimum was the best idea for analysis.

PCA

We did Principal Component Analysis to show if any variables stand out more than others.

We wanted to show with elbow plot that not many components were really needed. Note we took out the “Entry Level Contract” type and filtered that players must have played 20 games

We then wanted to see in the first two components what variables had the most weight as this could help us in our reduction methods.

Linear Regression Model

Initial Modeling

Now look at the regression model to see what the leading coefficients are.

Correlation Matrix

Here we can look at how the correlations are connected with each other.

Top Regression Model

Here we examined the RMSE of some different regression models to see which one was the best.

Lasso/Ridge Regression

We wanted to show what the ridge plot looks like.

Also wanted to show the lasso plot.

Next, we wanted to show how all the models compared. We took the linear regression model and compared it to ridge (which has an alpha level of 0) and increased the alpha by every quartile until it got up to 1 (which is a lasso model). Here we wanted to show which model was the best.

Random Forest

Here we wanted to show a plot of the top 100 most important variables in this type of dimension reduction technique.

We did the same thing subsetting the data by forwards.

And lastly did the same random forest technique subsetting it by defense.

Trying to find the most important variables

By having these different plots, we tried to see if we could find a pattern somewhere that would give us the most important variables to use in our reduction.

We first looked at the all data, without subsetting by position.

## # A tibble: 74 × 2
##    value                           n
##    <chr>                       <int>
##  1 age                             3
##  2 all_assists                     3
##  3 all_diff_penalty_minutes        3
##  4 all_i_f_giveaways               3
##  5 all_i_f_goals                   3
##  6 all_i_f_shots_on_goal           3
##  7 all_shots_blocked_by_player     3
##  8 diff_5on5_high_danger_goals     3
##  9 diff_all_high_danger_shots      3
## 10 games_played                    3
## # … with 64 more rows

We then did the same doing by forwards.

## # A tibble: 71 × 2
##    value                           n
##    <chr>                       <int>
##  1 age                             3
##  2 all_assists                     3
##  3 all_diff_penalty_minutes        3
##  4 all_faceoffs_lost               3
##  5 all_i_f_giveaways               3
##  6 all_i_f_hits                    3
##  7 all_i_f_shots_on_goal           3
##  8 diff_4on5_corsi_percentage      3
##  9 diff_5on5_high_danger_goals     3
## 10 diff_all_corsi_percentage       3
## # … with 61 more rows

And lastly looked at the defensive players.

## # A tibble: 74 × 2
##    value                           n
##    <chr>                       <int>
##  1 age                             3
##  2 all_assists                     3
##  3 all_i_f_giveaways               3
##  4 all_i_f_hits                    3
##  5 all_i_f_shots_on_goal           3
##  6 diff_4on5_corsi_percentage      3
##  7 diff_5on5_goals                 3
##  8 diff_5on5_high_danger_goals     3
##  9 diff_5on5_high_danger_shots     3
## 10 diff_all_corsi_percentage       3
## # … with 64 more rows

By using random forest, we could look at what the model says is the optimal number of predictors we should end up using.